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Abstract: We consider the problem of finite 
sample corrections for entropy estimation. New esti- 
mates of the Shannon entropy are proposed and their 
systematic error (the bias) is computed analytically. 
We find that our results cover correction formulas 
of current entropy estimates recently discussed in 
literature. The trade-off between bias reduction and 
the increase of the corresponding statistical error is 
analyzed. 
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Statistical fluctuations of small samples induce 
both statistical and systematic deviations of entropy 
estimates. In the naive ("likelihood") estimator one 
replaces the discrete probabilities pi, for i = 1, M, 
in the Shannon entropy 1 



M 



H = ~y] Pilnpi 



(1) 



by maximum likelihood estimates pt . More precisely, 
we consider samples of N observations, and let rii be 
the frequency of realization i in the ensemble. Then, 
with the choice pi = 4& , the naive estimate 



H = 



M 

^2 & ln ^ ,; 



(2) 



leads to a systematic underestimation of the entropy 
H. 

There is a series of publications trying to improve 
the estimation error successively with suitable terms 
of corrections. One approach is to apply a Taylor ex- 
pansion around the probability pi to the ln-function 



in 2j E] • A detailed computation of the ex- 
pectation value of H with respect to the multinomial 
distribution 
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p(n 1 ,...,n M ;p 1 ,...,p M ,N)=m]I (3) 



up to the second order in N was given by Harris [3] 
and gives 



E[H} = H- 
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(4) 

The 0(1/N) correction term was first obtained by 
Miller [2]. The term of order l/N 2 involves the 
unknown probabilities pi, and can not be generally 
estimated reliably. In particular, it would not be 
sufficient to replace them by pi in this term. 

In order to extend the estimation beyond correc- 
tions of order l/N, Paninski [S] applies Bernstein 
approximating polynomials, which are defined as a 
linear combination of binomial polynomials. It can 
be shown, using results from approximation theory, 
that there exist expansion coefficients such that the 
maximum (over all pi) systematic deviations are 
of the order l/N 2 . This is better than the order 
l/N rate offered by the correction terms mentioned 
above. Unfortunately, the good approximation 
properties of this estimator are a result of a delicate 
balancing of large, oscillating coefficients, and the 
variance of the corresponding estimator turns out to 
be very large Thus, to find a good estimator, one 
has to minimize bounds on bias and variance simul- 
taneously. The result is a regularized least-squares 
problem, whose closed-form solution is well known. 
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However, one can only hope that the solution of 
the regularized problem implies a good polynomial 
approximation of the entropy function. The latter 
also depends on whether the experimenter is more 
interested in reducing bias than variance, or vice 
versa. 

An alternative approach, where only observables 
appear in the correction term, was proposed by 
Grassberger |H] . There it was assumed that all pi <C 
1, so that each is a random variable which should 
follow a Poisson distribution. To start with, we con- 
sider Renyi entropies of order q > 

M 

H(q) = In £ Vl (5) 

^ i=l 

The Shannon case results from taking the limit q — > 1, 
i.e. H = lim g _»i H(q). For the estimation of H(q) it 
seems obvious first to ask for an unbiased estimator of 
any term p q of the sum in JSJ. In the case of integer 
values of q ^ 1 the situation is trivial because the 
unique unbiased estimator p q is 1 

^ 1 n! 

p9 = mJ^qy. n - q > (6) 

with p q :— for n < q. However, to achieve q — > 1, 
it is necessary to look first for a generalization for 
arbitrary q. As shown in jS], the analytical contin- 
uation of the estimator is non-trivial since a naive 
replacement of the factorials in © by T-functions is 
biased. Indeed, unbiased estimators of p q do not ex- 
ist for non-integer values of q. Nevertheless, in |S] an 
interesting estimator of p q was proposed which is at 
least asymptotically unbiased for large N, and is also 
a "good" approximation in the case of small samples. 
The corresponding estimator of the Shannon entropy 
is 2 

For the interesting case of small probabilities pi <C 1 
the estimate (J7J) is less biased than the estimator ob- 
tained by the Miller correction. 

A further improvement, related to the latter ap- 
proach, which is also based on the assumption of Pois- 
son distributed frequencies, was recently proposed by 

1 For simplification the index i will be omitted. 

2 The summation is defined for all m > 0. The digamma 
function ip(n) is the logarithmic derivative of the T-function, 
see e.g. 



Grassberger jTJ. The corresponding entropy estima- 
tor of the Shannon entropy is 

H ° = E (VW " ^ 1 Y+t dt ) ■ 

„ (8) 

The correction term of the earlier estimator H^, is 
recovered by a series expansion of the integrand in 
(JSJ up to the second order. The higher order terms 
of the integrand lead to successive bias reductions 
compared to (JJJ. 

At this point, one might ask whether further im- 
provements in bias reduction are possible. Moreover, 
it is of special interest to consider the trade-off be- 
tween bias reduction and the increase of the corre- 
sponding statistical error. In the following theorem, 
we propose a family of new entropy estimators and 
determine their systematical error analytically. We 
will present a detailed analysis of the bias and show 
that the entropy estimators above are specific exam- 
ples of our general results. 

In view of the following computations we note that 
the Shannon entropy is a sum of terms, h(j>i) — 
—pi hi pi, which exclusively depend on the class i, for 
i = 1,...,M. Therefore, when we consider expecta- 
tion values with respect to rij, the computations can 
be carried out by replacing the joint distribution Q 
by the binomial distribution 

P(n i ;p i ,N)=(*)p?(l-p i ) N -^, (9) 

for < rii < N and E[rij] = PiN. Now let us consider 
the following 

Theorem: Let £ > be a real number and 

i/C-i 

^n) = j(w-^)-(-ir / Yr t dt )^ 



(10) 

be a parametric family of estimators of the function 
h(p) = —p hip. For the particular case n — 0, let 
0) = 0. Then, we have the identity 

E[h(tn)] = -plnp + b(tp), (11) 

and 

/•!-?>/£ +N-1 

b(Z,p) = -p J T^t dt (12) 

is the bias of the estimator h(£, n). 
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From the theorem we directly obtain an esti- To obtain the explicit expression of S q (£, n), we con- 

mator of the Shannon entropy by summation of I|1U|) . sider the necessary /sufficient condition for unbiased- 

i.c. Hs(0 — Si=i M^j 71 *)- Using a similar notation ness 

as in [7] we receive the following expression E[S q (£, n)] — Tjv(p). (21) 

, m After inserting l|19fl and H2U|) into to the condition 

H s (0 = tJj(N) --J^ni S ni (0 (13) CD and then comparing coefficients, it follows 



N 

i=l 



with 5 q (Z,n)=(*) Y, F ^k,N)(^_^). (22) 

rl/C-l t n-l 



n 



k=0 



S „(£) = i/j(ri) + (-l) n / eft. (14) ™. , ,. . ,„ , . . . , 

' 1 + t -1 his unbiased estimator oi 1 n \p) is unique because 



the identity (|21|l is satisfied for arbitrary 9. Next 
Proof: For real q > we consider the finite Taylor we carry out the derivation of 5 q {£,n) with respect 
series approximation of jfl around £ > 0, i.e. to q ^ and consider the limes q -»• 1. For this purpose 

we note that the derivative of the binomial coefficient 



r^) = fV 9 V>-er, as) ® is 



!/=0 



with G?q 



n 



9=1 



v ; ^(f-i) - (23) 

v v = 0, 1. 



By direct computation it follows, that the negative 
derivative of the estimator S q , for q — > 1, is given by 
We expand the brackets on the right hand side of ^he expression 
(1151) and rearrange the terms in order to obtain the 

following double summation dS n £ / 1 

at N 9 ~* 1 dq N JV \ ( 

ziroo = Ewr E it) Q n V t-i 

(-1)" / frr*. (24) 



n=0 

For simplification we introduce the substitution 

JV 



N" ' J 1 + t 



u—k 



On the other hand, when applying the same proce- 



F(£,q,k,N) = (-l) k £ q ~ h ^ I J I j(— 1)". (18) dure to the Taylor series expansion T N (p), we find 



dT N £ ( p x A 



Then, by further algebraic manipulations we obtain — 1™ ^ — p\np + — — 

the identity 



t N-l 



(l-p) N ^ ^\n-k 

^ r> n=0 fc=0 V 

.,, _ . _. . r , . . Equating both by using 112 11 1 and applying the trivial 

with 6 = p/(l - p). The rhs of the latter expression id ti _ E[fl _ 1/m = (1 _ /£)N by usin „ the 

is a polynomial in 6 whose (N + 1) coefficients are , L\, J J 1 y , , '. w , , ' 

... . . . , it.,. i i notation oi tne theorem, we obtain tne result 

all independent ol the probability p. On the other 

hand, there is an unbiased estimator, say 5 (£,n), cit/t m i , utt \ ma\ 

{ ., ■ rr is ■ 1 1 ' •;. 9VS '/ E/if,n = -plnp + & 26 

of the expansion T^yp), since the expectation value 

of S q (£,n) can also be expressed by a polynomial of ThuSj the claim has been proven . Finally, we 

finite order in O, i.e. consider the residual term, Rn+i, of the Taylor se- 



N lies expansion T/v(p). By definition, the identity 

E = y f N \ S q (£,n) 6™. (20) P q = Tjv ^) + R n+i(p) is valid. Using the latter 
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Figure 1: Systematic error of n) for samples of 
N = 6 observations and p — 0.07. Several special 
cases of £ are shown. The case £ = e~2 slightly 
improves the estimator (see the dot within the 
circle). 



Figure 2: Numerical computation of the statistical 
error cr(£,p) for samples of N = 6 observations and 
p = 0.07. The minimum of cr(£,p) is obtained by the 
solution of E[h§§] = h&. 



Rn+i, then we find the following relation between 
the bias and the first derivative of the residual term 

□ 

Every point on the continuous line in Fig^ is 
the bias of the corresponding estimator h(£,n). It is 
unbiased for £ = p, and there is a turning point for 
£ = pN. The estimator is asymptotically unbiased, 
i.e. b(£,p) -> for N -> oo, if £ > p/2. On the 
other hand, in Fig [21 we see the mean square error 
(statistical error) <r 2 (£,p) = E[(h(£,n) - h(p)) 2 ]. The 
trade-off between bias and the statistical error of 
the estimator is shown in Fig0 Typically, one is 
more interested in the error of the entire sum over 
the states i in Eq.QJ. If there are M terms, and 
if each is roughly of the same order of magnitude, 
then the total bias and the total variance are both 
~ M, thus the statistical deviation increases only 
as M 1 / 2 . Therefore, the more terms one has (the 
larger M), the more one is interested in using small 
values of £ > p/2, if one wants the total statistical 
and the total systematic deviations to have the same 
size. Thus, the interesting estimators lie between 
both extremes, i.e. the minimum statistical error, 
and in case £ = p with vanishing bias. The following 
particular cases are especially interesting to focus 



on: 

£ = 1: In this case we obtain the trivial estimator 
for h(p) 

A 77 

h(l,n) = -(Tjj(N)-^(n)), (28) 

and h(l,0) — for n — 0. By the identity (TTJ we 
receive the following expectation value 

rl—p +JV-1 

E[h(l,n)] = -plnp-p / -r—rdt. (29) 
Jo i — I 

The latter expression has been recently mentioned 
in [7j (citation [14] in it). In the asymptotic regime 
n ^> 1 it leads to the Miller correction 3 , i.e. h(l, n) = 
-f lnf + J N + 0{l/N*). 

£ = e~ 2 : The Grassberger estimator is a spe- 
cial case, since it is not exactly covered by our the- 
orem. However, it can be very well approximated, 
if the Taylor expansion is chosen around the partic- 
ular value of £ = e~ 2 . By numerical analysis we 
verified that the corresponding estimator h(e^^,n) 
is less biased than the estimator Q, for any N > 1 
and arbitrary p. In FigQ] we see that there is almost 
no difference between both estimators. However, by 
numerical verification, slight improvements become 
visible for larger probabilities (e.g. p > 0.8). In the 

3 This is because in the asymptotic regime we have the re- 
lation ip(x) ~ ln(a;) — l/2x. 
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bfi,p)vs.(jfi,p) 
Miller/Harris/Herzel 
Grassberger 1988 
5= 1/exp(0.5) 

5=1/2 (Grassberger 2003) 
% = pN (turning point of b(£, T p)) 
Minimum o"(!;,p) 
% = p (zero bias) 
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statistical error 



Figure 3: Trade-off between bias and statistical error 
for samples of N = 6 and p = 0.07. The reduction of 
the bias for £ < pN is related to a strongly increasing 
statistical error. On the other hand, the bias corre- 
sponding to the minimum statistical error is larger 
than all the above mentioned estimators. 



case of a single observation, i.e. N = 1, there is no 
difference between the two, for any p. 

£ = i; This case is identical to the Grassberger 



estimator (JSJ, see 0. As shown in Fig^ it is less 
biased than the Miller estimator and the estimator 
/i(e~2 7 n). But the statistical error of h(^,n) is 
slightly bigger as we can see in Fig[3 In the left 
half of the unit interval, i.e. £ < |, we obtain 
further reduction of the bias. But now one has to be 
attentive since we have the limes \h(t;,n)\ — > oo for 
n — > oo. Although n is always finite in practice, this 
behavior is an indication that the statistical error of 
all estimators with £ <C 1 could increase very fast. 
The dramatic increase of for £ — > is shown 

in Fig|21 Therefore, the particular choice £ = 4 
seems to be very suitable for estimation because it 
has the smallest with h — > h(p) for n — > oo 

and any p 6 (0,1]- On the other hand, the most 
conservative case is given by the minimum variance 
estimator (see Fig|3J|. In this case the value of the 
statistical error and the absolute value of the bias are 
comparable. A compromise between both extremes 
might be the estimator for £ = e~2 k, 0.6. This case 
is less biased than the minimum variance estimator, 
and less risky than the Grassberger estimator Ha- 



estimators h{^,n) should be generally preferred. A 
good choice of the parameter £ always depends on 
the special application under consideration and the 
individual preference of the scientist. 
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To sum up, in the above analysis, we see that 
it is not possible to decide which of the many 
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